Sude Yılmaz

IE 582 Homework2

Data Preprocessing

Task 1

Q1

Calculating:

Visualizing Final Result Probabilities for Matches Involving "Reims" Based on Calculations Above

Q2

Normalizing Probabilities by dividing them to sum of final probabilities:

Visualizing Final Result Normalized Probabilities for Matches Involving "Crystal Palace" Based on Calculations Above

Q3

The function plotter1:

It takes two inputs: the half-time and a flag indicating whether to use normalized or non-normalized probabilities for plotting. It performs the following tasks:

The above graphes show that:

The function plotter2:

It takes two inputs: the half-time and a flag indicating whether to use normalized or non-normalized probabilities for plotting.

The key difference compared to the previous plots is that these graphs are color-mapped based on the actual outcomes of the games.

This approach provides a clear visual distinction between tie and non-tie outcomes.

The above graphes show that:

Here’s a refined version of your sentences:

Determining the Bins:

Below code;

The above results show that:

From the above histogram, it can be observed that there is a slight bias towards the left. This indicates that (p(home win)- p(away win)) being greater than 0 is more likely, as the home team is expected to have a higher probability of winning than the away team.

Task 2

1) Filter the matches with Na values for the columns 'Goals - home' and 'Goals - away'

2) Create another column of "score_diff" to store score difference between Home Team - Away Team (this value can be negative if the Away team is currently winning)

The function filter_out_matches:

Filters out rows from the DataFrame based on the given conditions. If any row satisfies the rule, all rows with the same fixture_id are excluded.

Inputs:

Returns:

Selected Rules:

1) filter1:

if away team gets a Redcard in the first 15 minutes of the game.

2) filter2:

if home team gets a Redcard in the first 15 minutes of the game.

3) filter3:

if away team gets a Yellowredcard in the first 15 minutes of the game.

4) filter4:

if home team gets a Yellowredcard in the first 15 minutes of the game.

5) filter5:

if away team gets Penalties in the first 15 minutes of the game.

6) filter6:

if home team gets Penalties in the first 15 minutes of the game.

7) filter7:

when score_diff is 1 meaning Home team is ahead by 1 score, and Away teams scores a goal in the last 5 minutes of the game.

8) filter8:

when score_diff is 0 meaning there is a tie, and away teams scores a goal in the last 5 minutes of the game.

9) filter9:

when score_diff is -1 meaning Away team is ahead by 1 score, and Home teams scores a goal in the last 5 minutes of the game.

10) filter10:

when score_diff is 0 meaning there is a tie, and Home teams scores a goal in the last 5 minutes of the game.

Total number of 5636 matches are removed after applying the filters. Especially, for the last 4 filters there is many number of "abnormal" games. Therefore, it might make a difference in the result.

Again, the histogram also have a similar behavior as before.

Task 3

First, I removed the features that are non-related to the outcome of the game. Which are:

I then created a new feature, time_in_minutes, which represents the minute of the game while accounting for half-time information. This feature can be beneficial for the decision tree, as the remaining time in the game significantly impacts the probability of winning or losing.

The first tree achieves a prediction accuracy of 0.6586, which is considered moderate performance. This can be improved by adding new variables or adjusting the parameters of the decision tree.

The performance of the tree got slightly better but still not that much high.

To Compare Feature Importance for Two Decision Trees:

Feature Importance:

Performance Comparison:

Key Differences and Insights:

Conclusion: